arXiv: 1504.03156vl [math.SP] 13 Apr 2015 


1-21 


Streaming, Memory Limited Matrix Completion with Noise 


Se-Young Yun 
Marc Lelarge 
Alexandre Proutiere 


SEYOUNG.YUN@ INRIA.FR 
MARC. LELARGE @ENS.FR 


ALEPRO@KTH.SE 


Abstract 


In this paper, we consider the streaming memory-limited matrix completion problem when the 
observed entries are noisy versions of a small random fraction of the original entries. We are 
interested in scenarios where the matrix size is very large so the matrix is very hard to store and 
manipulate. Here, columns of the observed matrix are presented sequentially and the goal is to 
complete the missing entries after one pass on the data with limited memory space and limited 
computational complexity. We propose a streaming algorithm which produces an estimate of the 
original matrix with a vanishing mean square error, uses memory space scaling linearly with the 
ambient dimension of the matrix, i.e. the memory required to store the output alone, and spends 
computations as much as the number of non-zero entries of the input matrix. 

Keywords: matrix completion, streaming input, limited memory, computational complexity 

1. Introduction 

Reconstructing a structured (e.g. low rank) matrix from noisy observations of a subset of its en¬ 
tries constitutes a fundamental problem in collaborative filtering Rennie and Srebro (2005), and 
has recently attracted much interest, see e.g. Candes and Recht (2009), Candes and Tao (2010), 
Keshavan et al. (2010), Recht (2011). The recent development of matrix completion algorithms has 
been largely motivated by the design of efficient recommendation systems. These systems (amazon, 
netflix, google) aim at proposing items or products from large catalogues to targeted users based on 
the ratings provided by users of a small subset of items. This goal naturally translates to a matrix 
completion problem where the rows (resp. the columns) of the matrix correspond to items (resp. to 
users). And often, the (item, user) rating matrix is believed to exhibit a low rank structure due to the 
inherent similarities among users and among items. 

In this paper, we address the problem of matrix completion in scenarios where the matrix can 
be extremely large, so that (i) it might become difficult to manipulate or even store, and (ii) the 
complexity of the proposed algorithms should not rapidly increase with the matrix dimensions. In 
other words, we aim at designing matrix completion algorithms under memory and computational 
constraints. Memory-limited algorithms are particularly relevant in the streaming data model, where 
observations (e.g. ratings in recommendation systems) are collected sequentially. We assume here 
that the columns of the matrix arc revealed one by one to the algorithm. More specifically, a sub¬ 
set of noisy entries of an arriving column is observed, and may be stored, but the algorithm can¬ 
not request these entries later if they were not stored. The streaming model seems particularly 
appropriate to model recommendation systems, where users actually seek for recommendations 
sequentially. Recently, motivated by the need to understand high-dimensional data, several ma¬ 
chine learning techniques, such as PCA Mitliagkas et al. (2013) or low-rank matrix approximation 
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Clarkson and Woodruff (2009), have been revisited considering memory and computational con¬ 
straints. To our knowledge, this paper provides the first analysis of the matrix completion problem 
under these constraints (refer to the related work section for a detailed description of the connection 
of our problem to existing work). 

Throughout the paper, we use the following notations. For any mxn matrix A, we denote by 
its transpose. We also denote by si(A) > ■ ■ ■ > s nArn (A) > 0, the singular values of A. The SVD 
of matrix A is A = UYV' where U and V are unitary matrices and £ = diag(sj (.4),... s nAm (A)). 
A~ 1 denotes the Pseudo-inverse matrix of A, i.e. A~ 1 = VT,~ 1 U'. Finally, for any vector v, ||n|| 
denotes its Euclidean norm, whereas for any matrix A, \\ A\\f denotes its Frobenius norm, | /111 2 its 
operator norm, and HAHoo its ^oo-norm, i.e., ||2l||oo = rnaxy \Aij\. 

Contributions. Let M € [0, l] r " xn denote the mxn ground-truth matrix we wish to recover 
from noisy observations of some of its entries. M is assumed to exhibit a sparse structure (refer to 
Assumption 1 (ii) for a formal definition), m and n are typically very large, and can be thought as 
tending to oo. We assume that each entry of M is observed (but corrupted by noise) with proba¬ 
bility 5 (independently over entries). The random set of observed entries is denoted by fi, and we 
introduce the following operator from M mxn to itself: for all Y € R mxn , 



Yij, if 

0, otherwise. 


Then, we wish to reconstruct M from the observed matrix A = Vq(M + X), where X is a noise 
matrix with independent and zero-mean entries, and such that My + X V} € [0,1]. Note that 5 
typically depends of n and m, and tends to zero as n and m tend to infinity. Finally, we analyze the 
matrix completion problem under the streaming model: we assume that in each round, a column of 
A is observed. This column is uniformly distributed among the set of columns that have not been 
observed so far. 

We present SMC (Streaming Matrix Completion), a memory-limited and low-complexity algo¬ 
rithm which, based on the observed matrix A, constructs an estimator M of M. We prove, under 
mild assumptions on M and the proportion 5 of observed entries, that M is asymptotically accu¬ 
rate, in the sense that its average mean-square error converges to 0 as both n and m grows large, 

i.e., = o(l). More precisely, we make the following assumption. 

Assumption 1. (i) ||M|||, = 0(mn). 

(ii) (Structural sparsity of M ) there exists i < min(n, m) such that ^AM) = W (1) ar *d Y^j=i +1 
o(mn). We denote by k the smallest i satisfying this condition. 


k log 2 m k logm 
n ’ m ’ m 


)), and 5 = o (-^—). 

n Mog ■ ! m J 


(iii) 6 = u(k max(^ 


The main result of this paper is a direct consequence of Theorems 5, 6, and 7. It states that under 
Assumption 1, with high probability, the SMC algorithm provides an asymptotically accurate esti¬ 
mate M of M using one pass on the observed matrix A, and requires ()(krri + kn) memory space 
and 0(Smnk ) operations. 

Note that Assumption 1 (ii) is satisfied as soon as M has low rank. More precisely, when 
rank(M) = K, then (ii) is satisfied when k = K. In such a case, there is a non-empty set of 
sampling rates S for which SMC yields an asymptotically accurate estimate of M as soon as K = 
°( iog(m)^ ) ('f f° r example rn and n grows at the same pace to infinity). 
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Note also that 0{km + kn) is the dimension of the ambient space for M, i.e. M can be well- 
approximated by a rank(/;:) matrix and hence {km + kn) is the minimum memory size required to 
output a good estimate of M. Our algorithm SMC is optimal in the sense that it only requires the 
amount of memory required to store the output. 

The SMC algorithm consists in three main steps. 

• Step 1. We first treat the I = s ^ — first arriving columns. These columns do not contain 
enough information to learn the right singular vectors of M since there are many rows with 
no observed entries. Instead, we can extract the top k right singular vectors for the submatrix 
of M corresponding to the t arriving columns. Let A { ,S) be the l arriving columns and Q be 
the top k right singular vectors extracted from the A ^ B >. After finding Q, we compute and 
keep W = AWQ for the next step. W will be used to recover the top k right singular vectors 
of M. 


• Step 2. We extract the top k right singular vectors of M using W. We show that the linear 
span of the columns of V = A} -W is similar to the linear span of the top k right singular 
vectors of M (Theorem 4). Although W is noisy, the matrix product amplifies the 1 inear span 
of the top k right singular vectors of M. 

• Step 3. Once we know V\ it is easy to find column vectors U such that = o(l). 

First, using the Gram-Schmidt process, we find R such that VII is an orthonormal matrix and 
compute U = ^AVIllA . Then, UV^ = ^AVR(VR)^ where VR{VR) t is the projection 
matrix onto the linear span of the top k right singular vectors of M. Therefore, UV' becomes 
very close to the best rank k approximation. 


We show that these three steps can be realized in a memory-efficient manner, and using low 
complexity algorithms. 

Additional Notations. When matrices A and B have the same number of rows, [A, B ] to denote the 
matrix whose first columns are those of A followed by those of B. For any matrix A, A± denotes 
an orthonormal basis of the subspace perpendicular to the linear span of the columns of A. Ai, A J , 
and Ajj denote the 7-th column of A, the j-th row of A, and the {i, j) entry of A, respectively. For 
b > a, A"'- 1 ' and A,,-}, are submatrices of A respectively defined as A a:b = (A J ).j =a . j t and A a: b = 
{Ai)i =a ^^ : b- Also, we will abbreviate /I]to A^ ; . Finally, we define the following thresholding 
operator for matrices. The operator is defined by two real positive numbers a and b, with b > a, and 
if applied to A, it returns the matrix \A\ b a such that 


U\ 


alV 


b if Aij > 6, 

< A^ if a < A^ < b, 
k a if A^ < a. 


2. Related Work 

This section surveys existing work on the design of matrix completion algorithms. We also provide a 
description of recent work on rank-A: approximation and PCA algorithms, as these algorithms could 
be seen as building blocks of matrix completion methods. The section is organised as follows. We 
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first review algorithms for matrix completion. We then focus on streaming algorithms for rank-k 
approximation, and PCA. Finally we discuss algorithms designed to be computationally efficient. 

Matrix completion algorithms. Candes and Recht (2009) first showed that in absence of noise 
(i.e., X = 0), the matrix M, with low rank k, can be recovered exactly using convex relaxation under 
some conditions on the sampling rate 5 and the singular vectors. These conditions were improved 
in Candes and Tao (2010) and Recht (2011), and the approach was also extended to the case of 
noisy observed entries Candes and Plan (2010). The proposed algorithms involves solving a convex 
program, which can be computationally expensive. If the rank k of the matrix is known, M can be 
recovered using simpler spectral methods. For example, in Keshavan et al. (2010), the authors show 
that in absence of noise, M can be reconstructed asymptotically accurately using 0{5kmn log n) 
operations under the conditions that the rank k does not depend on n and m, 5m = w(l) and 
5n = w(l). Again these results can be adapted to the presence of noise Keshavan et al. (2009). 
In this paper, we improve the spectral method used in Keshavan et al. (2010) and Keshavan et al. 
(2009), so that it becomes memory-efficient, and so that it has performance guarantees even if the 
rank k of M scales with m and n. 

Streaming algorithms. Clarkson and Woodruff (2009) proposes an algorithm to provide a rank-/,’ 
approximation of a fully observed matrix A, using 1-pass on the columns of A. The algorithm 
uses a random m x £ Rademacher matrix S, with an appropriate choice of i, and outputs a rank-/,: 
matrix A ik '> constructed from ,4LS' and AAKS. When setting £ = 0(ke~ 1 log(l/p)) which requires 
0{ke~ 1 {m + n) log(l /rj)) memory space, it is shown that with probability at least 1 — rj, 

\\A-AW\\ F <(l + e)\\A-AW\\ F , (1) 

where A ! ' k> is the optimal rank-/;: approximation of A. We could think of applying this algorithm to 
our problem. If the observed matrix A is A = Vq{M + X), it would make sense to estimate M 
by ^A <k) where A^ is the output of the algorithm in Clarkson and Woodruff (2009) applied to A. 
Indeed, it is easy to check that \\M — ^A^ \\ 2 F = o(mn) (i.e., the optimal rank-/,: approximation 
of jA estimates M asymptotically accurately). However, in general, is not asymptotically 

accurate: 


\\M-]A^\\ 2 f > (\\A - - \\A - A^\\ f - ||A( fc ) - 5M\\ F f 

mn ~ 6 2 mn 

(e\\A- A^\\ f - ||A( fc ) -5M\\ f ) 2 

5 2 mn 

Now, one can also easily check that ||A — A® ||i? = Q(V5mn) and || 6M — A^\\p = o{5y/mn), 

so that if we choose e = V5, we get 4- iL -— = f2(l). As a consequence, using the algo¬ 

rithm in Clarkson and Woodruff (2009), we cannot reconstruct M asymptotically accurately using 
0(ky/l/5(jn + n) log(l/ 77 )) memory space. Recall that our algorithm reconstructs M accurately 
with 0(k(m + n)) memory space. 

We could also think of using sketching and streaming PCA algorithms to reconstruct M. When 
the columns arrive sequentially, these algorithms identify the left singular vectors in 1 -pass on the 
matrix. We would then need a second pass on the data to estimate the right singular vectors, and 
complete the matrix. For example, Liberty (2013) proposes a sketching algorithm that updates the 
l most frequent directions when a new column of A is (fully) observed. This algorithm outputs a 
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Algorithm 1 Spectral PCA (SPCA) 

Input: A € [0, l] mx ^, k 

(Trimming) A <r- erase rows of A with more than max{10,10 S£} non-zero entries 
<f> •<— A^A — diag(At A) 

Vl:k<~QR (<M) 

Output: V\-k 


Algorithm 2 QR Algorithm 

Input: $ (of size t x £), k 

Initialization: Q l(] ' Randomly choose k orthonormal vectors 
for t = 1 to |"101og(^)] do 

Q( t )r( t ) «— QR decomposition of <h(5^ T_1 ^ 

end for 
Output: Q 


sketch A of A and has the following performance guarantee: ||AA T — AA '||2 < , . It also 

uses 0(m £) memory space. Again if we apply the algorithm to our matrix completion problem, 
i.e., to the observed matrix A = Vo(M + X), where M is of rank k, then | A11y- = Q(5mn) and 
a fc(AAt) = 0(<5 2 cr 2 (Af)) = 0( /' 2 ™ n ). Hence to efficiently extract the top k left singular vectors, 
we would need that = o(ofc(AAt)), which implies £ = ui(k/S). Therefore, the required 

memory space would he + kn ). Our algorithm is more efficient, and uses only 1-pass on the 

matrix. Note that the streaming PCA algorithm proposed in Mitliagkas et al. (2013) does not apply 
to our problem (in Mitliagkas et al. (2013), the authors consider the spiked covariance model where 
a column is randomly generated in an i.i.d. every time). 

Low complexity algorithms. There have been recently an intense research effort to propose low- 
complexity algorithms for various linear algebra problems. Randomization has appeared as an effi¬ 
cient way to reduce the complexity of algorithms, see Halko et al. (2011) for a survey. For example, 
Sarlos (2006) and Clarkson and Woodruff (2009) devise algorithms for rank-A: approximation with 
guarantees (1) and that use 0(5mn(k/e + k log k) + npoly(A;/e)) operations. When the input ma¬ 
trix is sparse, Clarkson and Woodruff (2013) leverages sparse embedding techniques, and reduces 
the required complexity to 0(5mn) + 0{{nk 2 e~ A + k 3 s~ 5 ) ■ polylog(m + nj) operations. But 
once again, as explained above, these results do not apply to our framework (( 1 ) is not enough to 
guarantee an asymptotically accurate matrix completion). 

3. Extracting Right-Singular Vectors 

As mentioned in the introduction, the SMC algorithm deals with batches of arriving columns. Infor¬ 
mation from each batch will be extracted and aggregated as more columns arrive. In this section, we 
present an algorithm that will be used as a building block for extracting information from a batch of 
columns. For concreteness, let assume that the size of a batch is £. In the SMC algorithm, £ will be 
chosen much smaller than m, so as to guarantee that the algorithm does not require large memory 
space. 
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The algorithm presented in this section addresses the following problem. Let M £ [0, l] mX ^ 
with singular value decomposition M = UXVf Given 0 < k < £ and A = Vq(M + X), we 
wish to estimate the k dominant right-singular vectors of M, V\±. At first, this might appear as a 
standard PCA task, but we are only interested in cases where A is very sparse. Indeed A only has 
a vanishing proportion 5 of non-zero entries. Note that on average, we have 5£ observed entries per 
row of M + X. Moreover, as this will become clear in the design of the SMC algorithm, we need 
to consider the case where 51 = o(l). In particular, there are many rows of A with no observed 
entry. As a consequence, we do not get any information about the corresponding rows of U in the 
singular value decomposition of M. Hence, we are here only interested in providing an estimate of 
the right-singular vectors V. 

The algorithm to extract the dominant right-singular vectors, referred to as SPCA (Spectral 
Principal Component Analysis), is simple and its design relies on the following observation. If we 
had access to the matrix M, then estimating the right-singular vectors of M would be obvious. 
Indeed MUT = VY?V\ so that a standard QR algorithm would output V. Now A constitutes a 
subsampled noisy version of M and we could try to apply this algorithm directly to A. From basic 
random matrix theory, we expect that the eigenvalues associated to the signal (i.e., the subsampled 
version of M'M) to be of the order of 5 2 sj. (M). On the other hand, the eigenvalues associated 
with the noise (i.e., the subsampled version of X^X) should be of the order 5\fm~l. Thus, one could 
believe that the eigenvectors obtained by applying the QR algorithm to A provide a good estimate 
of V\-k as soon as the ratio is large enough. However, this is not quite true, because of the 

sparsity of the matrix A. To overcome this issue, we need to regularize the matrix A before applying 
the QR algorithm. This is done in two steps: 

(a) Trimming-. The rows of the subsampled matrix A with too many non-zero entries are first 
removed. This trimming step is standard and avoids rows with too many entries to perturb the 
spectral decomposition. 

(b) Removing diagonal entries: Let A denote the trimmed matrix. The diagonal entries of the 
covariance matrix A'A are then removed: <1> = A'A — diag(/lL4). This step is needed 
because the diagonal entries of A^A scale as 5, whereas its off-diagonal entries scale as 5 2 . 
Hence, when A —>- 0, if the diagonal entries are not removed, they would be clearly dominant 
in the spectral decomposition. 

In summary, the SPCA algorithm consists in applying the QR algorithm to the regularized ver¬ 
sion of A, i.e., to <I>. Its pseudo-code is presented in Algorithm 1. The following theorem provides 
a performance analysis of SPCA, and is of independent interest. 

Theorem 1 Let £ < rn, £ = o{l/5), and M £ [0, l] mx ^ with singular value decomposition M = 
UTiV\ where E = diag(si(M),..., sg(M)) with s\(M ) > ■ ■ ■ > sg(M) > 0. Let A = Vn(M + 
X). Assume that there exists k < l such that s k {M) = ui(y/m), = w (l)> an d ^=j===| = 

w(l). Let V uk be the output of SPCA with input A and k > 0. Then we have || • (Vi-.kjx H 2 = 

o(I) with high probability. 

Note that the condition -^==| = oj(1) in Theorem 1 is similar to that suggested by the random 
matrix theory argument presented above. However we loose a log factor here because we use, in the 
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Algorithm 3 Streaming Matrix Completion (SMC) 

Input: {A \,..., A n }, k, £ 

1 .AW^[A 1 ,...,A e ] 

E ( ij) l([^ (B) ]ri > 0) 

3. A^\A^\ A^\A^ <- Split(A( s ),4,4,<5) 

4. (PCA for the first block)Q <- SPCA( A (n '\k) 

5. (Trimming rows and columns) 

A ( B2 ) make the rows having more than two observed entries to zero rows 

A 1B ' 2 ) c- make the columns having more than 10 md non-zero entries to zero columns 

6 . (Reference Columns) W 4 — A^ B2 ^Q 

7. (Principle row vectors) V' 1:/: 4 — [A^ B ^)^W 

8 . (Principle column vectors) I 4 — A^ B ^V VI 

Remove A^ B \ A^ Bl \ A^ B2 \ A^ Bi \ and Q from the memory space 
for t = £ + 1 to n do 

9. 4°A (2) <-Split(A t ,2,4,a) 

10. (Principle row vectors) V 1 4— {A^)^W 

11. (Principle column vectors) I 4 — I + A^ V 1 
Remove At and A' t from the memory space 

end for 

12. R 4 — find R using the Gram-Schmidt process such that FT? is an orthonormal matrix. 

13. U 4- fl-R-Rt 

0 

Matrix completion: IC/V^Iq 


proof, the Matrix Bernstein inequality (Theorem 6.1 of Tropp (2012)). The condition 00 

ensures a good separation in the spectrum of M and is needed to ensure that the space spanned by 
Vk+i-i is nearly orthogonal to the space spanned by V] by Davis-Kahan sin 0 Theorem (Theorem 
VII.3.2 in Bhatia (1997)). We conclude this section by analyzing the memory required by the SPCA 
algorithm, and its computational complexity. 

Required memory. SPCA needs to store A, 4> = A^A — diag(riJA), and V. The number of 
non-zero entries of A is 0(6m£), and for each entry we need to store its id and its value. Hence for 
A, 0(Sm£log(m )) memory is required. Similarly, the required memory for 4> is ()(d~rnJ: 2 log(Q). 
Finally, storing V \requires 0(£k) memory. Overall the required memory is 0(dm£log(m) + £k). 
Computational complexity. To run SPCA, we have to compute <b and apply the QR algorithm to 
4». The computation of 4> requires to perform <j '. ; 1,1 inner products of columns of A. Each inner 
product requires 0(5 2 m) floating-point operations, and thus the computational complexity to com¬ 
pute <I> is 0{5 2 m£ 2 ). Now in the QR algorithm, we compute <1>Q T and run the QR decomposition 
log(Q times. The matrix product <l>Q r requires 0(5 2 m£ 2 k ) floating-point operations, while the QR 
decomposition requires 0(£k 2 ) operations. Hence, the QR algorithm needs 0(£k(5 2 m£+k ) log(/;')) 
operations. Overall, the computational complexity of SPCA is 0(£k(5 2 m£ + k ) log ((')). 

4. Matrix completion with Streaming Input 
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Algorithm 4 Split 

Input: A, a,b, 5 

Initial: , A^ 4— zero matrices having the same size as A 

for every [A\ uv do 

7 <-sC{l,...,i)} which is randomly selected over all subsets of {1,..., b} with probability 

I (f) N (l - i) b ~ lsl if sis not the empty set and with probability 1 - f(l - (1 - f) b ) if s is 
the empty set 

for i £ 7 do 

[ A^\uv [ A]uv 

end for 
end for 

Output: A^\...,A^ 


In this section, we present our main algorithm, SMC, that reconstructs a matrix M £ [0, l] mxn 
from a few noisy observations on its entries, i.e., from A = Vq(M + X). The pseudo-code of 
SMC is presented in Algorithm 3. SMC consists in three main steps: Step 1) Generate reference 
columns denoted by W, Step 2) Find principle row vectors V using W, and Step 3) Find U such 
that U ■ « M. In what follows, we explain each of these steps in details and show for each step 

which conditions of Assumption 1 are needed. All proofs are presented in Appendix. The singular 
value decomposition of M is M = UT,V L 


4.1. Step 1: Finding reference columns W 

We now explain the first step of the algorithm leading to a m x k matrix W containing reference 
columns. This step corresponds to lines 1 to 6 in the pseudo-code. 

Let A' 111 = A ];/ be the batch of the l first arriving columns of A. Note in particular that we 
have: 

A (b) = + X 1:t ) with, M {b) = M X:t = [7S . 

In line 2, we compute 5, an estimate of the sampling rate 6 . In line 3, we construct 4 undersampled 
copies of A^ b \ For i £ {1,2,3,4}, the different A^^’s are independent given M + X and have 
the same distribution as A <rs> , except that the parameter 6 is now replaced by <5/4. 

The first non-trivial operation is presented in line 4 where we apply the algorithm SPCA de¬ 
scribed in previous section to the matrix A^ Bl \ In order to apply our Theorem 1, we need to have: 


Sk(M^) 
Sk+i(M ( B )) 


uj{1) and 


\Jml log d 


w(l). 


( 2 ) 


Note that there is a slight abuse of notation as the distribution of A ! 111 > is the same as the one of 
A ilS) if we change 5 to 5 /4 but a constant factor 4 is clearly irrelevant here. Our first task is to 
translate the conditions (2) on the original matrix M. To this aim, we state the following lemma: 
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Lemma 2 Let M = UY.V^ be a m x n matrix and t < n. Denote by = M\ .p. If s k (M) = 

^( m.nlogm ) s^+FM) = Ct?(l), f/ip« with high probability, 


s k (U 1:k u\ :k M ( B) ) > 



and 


Sk{Ui,kUj.k M{B) ) 

si((I-U 1:k ul k )MW) 


w(l). 


Its proof is given in Appendix A. 2 and follows from the matrix Chernoff bound (Theorem 2.2 of 
Tropp (2011)). 

Note that U\. k U\. k is the orthogonal projection on the span of U\ :k . As a result, we have 

s k (M^ B ' ) ) > s k {U\- k u\. k M tB ') by a simple application of the Courant-Fischer variational for- 

£S 2 s 4 (M) 

mulas for singular values. In particular, as soon as j —>• oo, we see that the second condition 

in (2) is satisfied. To get the first condition in (2), we write: 

M= U l:k u\. k MW + (/ - U 1:k U\. k )M {B \ 

note that the first matrix is of rank k and we can use Lidskii’s inequality Sfc+i(A + B) < Sfc(A) + 
si(B) to get: 

s k+ i(M^) < Sl {{I -U 1:k u\ :k )M^). 

Hence we have 

s k (MW) > 8 k (U 1;k ul :k M^) 

s k+1 (M ( B )) - Sl {{I - U 1:k ul k )M( B )y 

and the first condition in (2) follows from the second statement in Lemma 2 as soon as its conditions 
are satisfied. Combined with Lemma 2, Theorem 1 allows us to get the properties of Q computed 
in line 4 of the Algorithm SMC: 

Corollary 3 Assume that there exists k and I such that = ^(1)’ = cu(l), and 

s k (M) = cu( mwl ° gm ). Let L 1:f be an orthonormal basis of the linear span ofV^: k . Then we have 
||(L 1: ^)t . QjJ| = o(l) with high probability, where Q is the £ x k matrix obtained in line 4 of the 
Algorithm SMC. 

Once we have Q, we compute what we call the reference columns as follows: 

W = A < Ba ) ■ Q. 


Note that W will be kept in memory during the whole algorithm. It is relatively easy to see that 
the linear span of the columns of W is a noisy version of the linear span of U± :k . Indeed, note that 
E[A.( S2 )] = | M^ b \ moreover we have = UH(V 1 ' 1 )^ ~ U\ :k T ,^thanks to Lemma 

2. Hence we have 

w = aWq « jU 1:k V [k] (VffiQ. 

By Corollary 3, the span of the columns of Q is approximately the span of the column of Vj’jf so that 
the singular values associated to the linear span of U\- k are = Ll(5yf l/ns k (M)) by 

Lemma 2. This value has to be compared to the noise level. For the same reason as in Section 3, we 
first trim the matrix (note that the first trimming phase in line 5 is made to ease the technical 
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proof). After the trimming process, the singular values of {A Pr2] — Ef/l^ 2 -']) • Q arc bounded by 
0(y/5m£). Unfortunately, in our setting this can be much larger than S sj 1 /nsk (A/). However, the 
hidden signal in W is in the span of the columns of U\-y- and all the columns that arrive belong 
(approximately) to this span. In the sequel, we use this fact in order to amplify the signal in W 
when estimating V and then U. 

4.2. Step 2: Finding principle row vectors V 

In this section, we explain how we recover V\ :k or at least k vectors having the same linear span as 

Let /I (11 = [A^ B3 \ j ,..., A r ! ]. Note that thanks to the splitting procedure in line 9, the 
columns of A( 11 are i.i.d. with sampling rate <5/4. In the SMC algorithm, we simply get an estimate 
of V as follows: V = (A'^yUU. The linear span of the columns of V becomes very close to the 
linear span of the columns of V\. k when 

*vLv) _. (1) 

»Mi - v,: t v* t )v) 

This can be seen as in Section 4.1 since V\ :k V l \ k is simply the orthogonal projection on Ij 
The above condition holds for the following reasons: 

• The signal is amplified (Lemma 12 in Appendix). Since E[A^] = | M, we see that 
V = (AM)'W « ^V^U ht Y. [t] (Vii)‘<Q 

Roughly, the signal which was £l(5^£/nsk(M)) is now multiplied by Ss k {M) and we get: 
Sk(Vi:kV} :k V) = U(U 1:fc U^(E[AW])tE[A( B2 )]Q) = n(S 2 s 2 k (M)Jl). 


• The noise is cancelled (Lemma 13 in Appendix). Since the two noise matrices A^ 2 ^ — 
K[A lri2> ] and A p > — E \A (-*-)] are independent, the noise directions are not amplified as much 
as the signals. We can bound the noise as follows: 

si((I ~ V 1 :k vt k )V) = o{5 2 s 2 {M)^). 


Putting things togetehr, we obtain the following result: 

Theorem 4 Assume that there exists k and i such that s\{M) = c u( mnl ° sm ), = w(l), 

and mn^k+llli) = Then we ,iave ll^feUII = °(1) with high probability. 
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4.3. Step 3: Finding principle column vectors U 

In the previous step, we identified a n x k matrix V estimating the principle row vectors of M. 
From this estimate, we now extract the matrix U such that \\UV^ — M\\f = o(mn). 

Let A 12 ' = [A^ Ba \ A^\, ..., A^]. For simplicity, suppose that the linear span of the rows 
of V' is exactly the same as the linear span of the rows of M. From V, we can generate a k x k 
matrix It using the Gram-Schmidt process so that VIt becomes an orthogonal matrix. Since VIt is 
an orthonormal basis of the linear span of the rows of M, we have 

m = ^waW]vr(vr)* = dmAWjVRBl) • = uv\ 

0 0 

where U = lE[A^]VRR*. 

From the above observation, we propose to compute U as follows: 

U = -JRR ] = ^aWvrr* 
o o 

= [AW]VR$ + ^(A - E [AW])VRR*. 

Then, we need to prove that the row space of A* 2 * — E[/l (2) ] is almost orthogonal to V, to get 
U = U + (A — E [A\)VRF$ = (1 + o(l))U. This is true only if n is large enough, indeed 
n = u(k/8) (see Appendix). 

We arc now ready to analyze the performance of the SMC algorithm. We first need to check 
that Assumption 1 implies the technical conditions required in our previous results. When M has 
k dominant singular values such that = w(l) and Jf i>k = o(mn), then, s\(M ) = 

( y (mn )' rp Q see jp,j s assume j s no t the case so that there exists k! < k such that s k ’ = 
and s\, +l (M ) = o( 2 ^ !i ). But then = w (l) anc * Yli>k' S 1 = °( mn ) which contradicts the 

minimality of k. Therefore, the conditions s|(M) = uj( rnnl ^ sm ) and = w(l) become 

t = cu(fclogm) and k i^+\oge) = w (l)> which are satisfied by Assumption 1 when t = 0(m ) and 
i = l0 g m ). Hence we obtain the following result: 

Theorem 5 Assume that Assumption 1 is satisfied with l = ^( g^ogm ) am d ^ = 0(m). Then with 
high probability, the SMC algorithm provides an asymptotically accurate estimate of AT: 

\\m - [t/yt]i|| F _ (i) 

mn 


4.4. Required Memory 

Next we analyze the memory required by the SMC algorithm. 

From line 1 to 8 in the pseudo-code. We need to store A* B \ A* Bl \ A* B2 \ A* B3 \ and A (B4 k 
Since these matrices arc sparse with sampling rate 5 or <5/4, we need to store only (){<)rn t) of their 
elements and 0(5milog m)bits to store the id of the non-zero entries. From the previous section, 
we know that the SPCA algorithm requires 0(8m£ log m + kl) memory to find Q. Finally we need 
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to store V and /. Thus, when £ = ^ lo A u .— , this first part of the algorithm requires 0(km + kn). 
From line 9 to 11. Here we treat the remaining columns. Note that before doing that, A [B] , A' B, \ 
A iB ' 2 \ A lBz \ and Q are removed from the memory. Using this memory, for the t- th arriving 
column, we can store it, compute V 1 ' and I, and remove the column to save memory. Therefore, we 
do not need additional memory to treat the remaining columns. 

Lines 12 and 13. From / and V, we compute U. To this aim, the memory required is 0(km + kn). 
In summary, we have: 

Theorem 6 When £ = the memory required to run the SMC algorithm is 0{km + kn). 

4.5. Computational Complexity 

The computational complexity of the SMC (Algorithm 3) depends on the number of non-zero ele¬ 
ments of A and l. More precisely: 

From line 1 to 8. From the previous section, the SPCA algorithms requires 0(£k(5 2 m£ + k) log(£)) 
floating-point operations to compute Q. The computations of W, V, and I are just inner products, 
and require 0(£k(5 2 m£ + k) log(d)) operations. 

From line 9 to 11. To compute V 1 and / when the t-th column arrives, we need 0(kmd ) operations. 
Since there are n — £ remaining columns, the total number of operations is 0(kmnS). 

Lines 12 and 13 R is computed from V using the Gram-Schmidt process which requires 0{k 2 m) 
operations. We then compute IRw using 0(k 2 m ) operations . 

When £ = and k 2 = 0(6n), the number of operations to treat the first £ columns is 

0(£k(5 2 m£ + k) log(T)) = 0(k5 2 m£ 2 log(£)) + 0(£k 2 log(T)) 

= 0(k 3 m 1°!^ ) + O(Smn) = 0{kmnS). 
log m 

Since the remaining part of the algorithm requires 0(6kmn ) operations as well, we conclude: The¬ 
orem 7. 

Theorem 7 Assume that Assumption 1 is satisfied with £ = Then, the computational 

complexity of the SMC algorithm is Oidkrnri ). 

5. Conclusion 

This paper investigated the streaming memory-limited matrix completion problem when the ob¬ 
served entries are noisy versions of a small random fraction of the original entries. We proposed 
a streaming algorithm which produces an estimate of the original matrix with a vanishing mean 
square error, uses memory space scaling linearly with the ambient dimension of the matrix, i.e. the 
memory required to store the output alone, and spends computations as much as the number of 
non-zero entries of the input matrix. Our algorithm is relatively simple, and in particular, it does ex¬ 
ploit elaborated techniques (such as sparse embedding techniques) recently developed to reduce the 
memory requirement and complexity of algorithms addressing various problems in linear algebra. 
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Appendix A. Appendix 

A.l. Proof of Theorem 1 

We can split <5 as follows: 

$ =6 2 V 1:k vl k M'M + $ - 5 2 V v . k vl k M'M. 

The power method can find V such that ||V’t(Vi : fl.)_ L || 2 = o(l) when s^v^.v^MiMW = 
which is shown in Lemma 11 of Yun et al. (2014). Since 

||$ _ 6 2 V 1:k v} :k M^M\\ 2 < ||E[$] - 5 2 V 1:k vl k M^M\\ 2 + ||$ - E[$]|| 2 

< ||A 2 diag(AY+AY)|| 2 + \\5 2 (I - V 1:k vl k )M* M\\ + ||$ - E[$]|| 2 

< 5 2 m + 5 2 s\ +l {M) + ||3> — E[<b] || 2 , (3) 

in the remaining part, we transform <h — E[4>] as a sum of random matrices, and then using Matrix 
Bernstein inequality we get an upper bound for ||<J> — E[<3>]|| 2 to conclude this proof. 

Recall that A 1 is the z-th low of A and 

m 

$ - E[$] = J2 - diag((A i ) t A) - E[(A i ) t A* - diag((A ,: ) t A i )]) . 

2=1 

Let jW = — diag((A*)^A*) — E [(A 1 )^A 1 — diag((A*)^A*)]. Then jW i s a self-adjoint 

£ x £ matrix and E[lW] = 0. 

The Matrix Bernstein inequality (Theorem 6.1 Tropp (2012)) is a matrix concentration inequal¬ 
ity for the sum of zero mean random matrices. 

Proposition 8 (Matrix Bernstein) Consider a finite independent random matrix set { 

where every X® is self-adjoint with dimension n, E[3(W] = 0, and ||X®|| 2 < R almost surely. 

Let p 2 = ||E^t E[A (i) A (i) ]|| 2 . Then, 

m / 2 /ex \ 

r{llE-V< i )|| s >x}< n exp(^Y_j. 

In order to use the Matrix Bernstein inequality, we have to find upper bounds for |||| 2 and p 2 . 
Since A 1 are independently sampled with probability <5, [X^>] uv has a some constant value if both 
u and v are sampled in A, and 0(<) 2 ) otherwise. Using these, the following lemmas bound || || 2 

and p 2 . 

Lemma 9 When n = cc (1), for 1 < i < m, there exists a constant C\ such that 

||X«|| 2 < Ci max{l, S£}. 


Proof: Since the number of non-zero entries of A 1 is bounded by max{10, \ ()/)£}, we can easily 
compute r u = |[aW] uu | < max{10,10 6£} + b£ for all 1 < i < m and 1 < u < £. By the 

Gershgorin circle theorem, therefore, for all i 

il* w lh < max{10,10(5^} + 5£. 
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Lemma 10 There exists a constant C 2 such that 

m 

II ^E[X«X«]|| 2 < C 2 mmax{5 2 £,6 3 £ 2 }. 

2=1 

Proof: Since the number of non-zero entries of A 1 is bounded by max{10,10 5£}, every |E[XWxW] u „ 
0(5 2 ( 1 + 51)) when v and every |E[xWxW] uu | = 0(5 2 £(1 + 51)). By the Gershgorin circle 

theorem, therefore 

m 

||^E[X«Y«]|| 2 = 0(5 2 m£(l + 5£)). 

2=1 


Let C = 16 mnx{C'|, C 2 }. From Lemma 9 and 10 and Proposition 8, 

P11|$ — E[<&]|| 2 > y/Clog(n) max{l, 5 2 m£, 5 3 m £ 2 }| < (4) 


Proof of Theorem 1: This proof starts with 

$ = 5 2 V 1:k vt k M'M + <F - 5 2 V 1:k vt k M'M = 5 2 V 1:k vt k M*M + Y, 
where Y = <I> — 5' 2 V\ :k V[ k ATt M. From (3) and (4) 


|Y|| 2 < 5 2 m + 5 2 s k+1 (M) + \J C log(f) max{l, 5 2 m£, 5 3 m£ 2 } 
= o(5 2 s 2 k (M)) + yJC \og{£) max{l, 5 2 m£}, 


where the last 


equality stems from s k (M ) = oj(m) and A+fiM) = w (l) 


the conditions of this theo¬ 


rem. Since the condition 5 ^4—1 = w(l) implies 5 2 m£ = t o(k 2 \og£) and s 

m£ log i \ ) V \ 6 ! ^Jc log(£) max{l ,5 2 mC} 


^C\og(l)8 2 m£ 

o(l) from Lemma 11 of Yun et al. (2014). 


£=^== = = w(l), we can deduce Sk ^ ~ = w (l)- Therefore, ||(Vi : / c )^(V r )_L|| = 


A.2. Proof of Lemma 2 


Let F = u\. k M and G = (I — U\ :k u\. k )M( B \ We find a lower bound for s k (T) and an upper 
bound .s i (G) using the matrix Chernoff bound (Theorem 2.2 in Tropp (201 1)). 

Proposition 11 (Matrix Chernoff) Let X be a finite set of positive-semidefinite matrices with di¬ 
mension d and satisfy maxxgA’ .s i (W) < a. Let 

£ £ 

fimin — I sfi ^ " X) and /3 max — i - si( ^ " 9f). 

' ' xex ' ' x&x 

When { X 11 F..., } are sampled uniformly at random from X without replacement. 


f 1 1 / p £ \ Anax/ 

p |»i(E xt,) ) a d+J s * () 

f ^ ) / p —£ \ Anin /> 

P X®) < (1 - e)fi mi A < d ^ (1 _ g)1 _ e j 


for e > 0 and 
fore € [0,1). 
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i) Sk{F): FF' 1 is the sum of £ matrices which are sampled uniformly at random from X = 
,..., u\. k M n (u\. k M n )^} without replacement where the matrix dimension is 
k. We can obtain the other parameters to compute the matrix Chernoff as follows: a = m since 
| M, 11 2 < m for all 1 < i < n and /3 m i n = s k (M ). From Proposition 11, 

f f 'I / p~ £ \ 

P { Sk(FF') < (1 - e)-4(M) j < k for e e [0> !)■ 

Therefore, when s|(M) = cu( mnl ^ sm ), 

p{»fc(^) < < -■ 

{ In J m 


ii) s\{G): GG' is the sum of matrices sampled uniformly at random without replacement from 
* = {(/- U 1 ;k ul k )M 1 ((I - Ui : kU\ :k )Mi)\ U 1 :k u\ :k )M n ({I - U 1 :k ut k )M n ) t }. 

Here, the dimension is m, a = m and /3 max = ^s^ +1 (M). From Proposition 11, 


si(GG^) > (1 + e)—s\ +1 (M) > < m 


n 


(1 + e) 


l+£ 




for e > 0. 


Whenwesete* = max{2, ||^|^},P{si(GGt) > (1 + e*)|4+i( M )} < ^ and (1+e*) s 2 k+1 (M) < 
max{3s| +1 (M), 3mw ] ogm }. Therefore, 


s k {Uy.kUikM (B) ) 

Sl ((I-U 1 :k ut k )M(B)) 


since s|(M) 


w( 


mn log m > 

I > 


and 


Sfc(M) 


w(l). 


A.3. Proof of Theorem 4 

We can rewrite {A^)^W as follows: 

(A^^W = E[(A( 1) ) t ]W + ((AW) t -E[(i4W) t ])VP 

= V 1:k vl k E[(A^]W + (/ - V 1:k vl k )E[(A^]W + ((A«) t - !E[(v4 Cl) ) t ])W. 

In the above equation, the columns of have the same space what we want to 

recover and the remaining part is noise. Thus, we can easily recover V satisfying ||V^ fc FjJ| = o(l) 
when 

_ _ = 

IIU * V llt vl t )n(A^nwh + IKtADjt - E[(^KD)t])w || 2 

Before giving the proof of (5) to conclude the proof of Theorem 4, we introduce key lemmas. 

Lemma 12 finds a lower bound for Sfc(V r i : fcV r 1 t fc E[(A.^)^] W) and an upper bound for || [I—V\ : kVl. k )¥j[{A^)^]W \\2 
and Lemma 13 induces an upper bound for ||((2l( 1 ))t — E[(A.( 1 ))t])FF|| 2 . 
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Lemma 12 When S |(M) = ^§5 

probability, 


w(l), and 


mn 2 (fc+log £) 


u;(l), with high 


s k (V 1:k vl k E[(A^]W) 


n 



and 


IK/-L 1 ;t L 1 t l )E[(/ll 1 ')t]H '|| 2 




Proof: The proof is given in Section A.4. ■ 

Lemma 13 For given Q and A^ B2 \ E[|| ((2l( 1 ))t — E[(A^ 1 ^)1'])PL|||,] = 0(5 2 kmn). 

Proof: Since every entry of A^ l > is randomly sampled with probability 5 /4 and W and W L ) are 
independent, for all 1 < i < n and 1 < j < k. 


E 


([(A« -KlA^W^y 


= E 


(^ [j 4( 1 )_E[A( 1 )]] ui [lL] nj )' 

_ 14=1 

2 _ 


< -\\W 3 r = 0(5 2 m), 


where the last equality stems from the trimming process on A^ B2 \ Thus, 

E[||(A (1) - E[A ( 1 ) ]) t iy||^] = 0(d 2 kmn). 


Proof of Theorem 4: When - -^fjr - = tu(l), from Lemma 12, Lemma 13, and the Markov in¬ 
equality, ^ | w ith high probability. Let V = V'Y,'{U' f be the singular value 

decomposition of V. Since 

IIU - v l:k vl k )v I| 2 >||(/ - v 1:k vi :k )v'\\ 2 8 k (v) = 


and s k (V) = 
Sk(Vl:kVLV) 

\\(I-Vv.kVl k )V\\ 


£l(s k (Vi: k V^. k V)) from the Lidskii ineuality Sfc + i(A + B) > Sfc(A) — 
- = cu(l) implies || {Vi:k)\VW 2 = o(l). Therefore, with high probability, 

ll^fc^klh = V 1 - sftVLV ) = \\(Vl:k){V'\\ 2 = 0(1). 


S k+ i(A), 


A.4. Proof of Lemma 12 

Since W = A^ B2 ^Q = ^[A^^Q+^A^A— E[A^ 2 ^])(5, we find a lower bound for 
and an upper bound for ||(/ — Li ; fcV , jt fc )E[(A^ 1 ))^]lL ||2 from 

s k (V 1:k vl k E[(A^]W) > s k (V 1:k vl k E[(A^]E[A^}Q)- 

||(E[A( 1 )])t((A( B2 ) - E[A^])Q )|| 2 and 
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\\(I - V 1 :k vl k )n(A^]W \\ 2 < ||(I - y 1:fc F 1 t fc )E[(^ 1 ))t]E[A(^)]Q|| 2 + 

||(]E[^. (1) ]) t ((^4 (jB2) - E[A {B 2 ) ])Q)\\ 2 . 


( 6 ) 


Key lemmas: The following lemmas bound each element of the above inequalities. To show the lem¬ 
mas, we use Corollary 3: = o(l) with high probability when cr|(M) = u>( mnl ° s ’ n ), 

Sk(M) /-I \ J S 2 (sf.(M) _ s 

Sfc+l (M) 1 mn 2 log t ' /' 

Lemma 14 When s 2 k (M ) = iu( mnl ° gm ), = w ( 1 )’ and = w ( 1 )> with hi g h prob¬ 

ability, 

^(Fi: fc ^ fe E[(^( 1 ))t]E[^l^)]Q) = 0 L 2 sl(M)^\ . 


Proof: Since every entry of W ,S "P and ,4 ! 1 ' is randomly sampled with probability 5/ 4, we know 
that E[(4lW)t] = | VTiU^ and E[2l( S2 )] = | UT,(V 1:i y. Under the conditions of this lemma, 

from Corollary 3 |(U 1: ^)tQ ± || = 0 (i) and from Lemma 2 s k {U\. k M^) > ^ s k (M ) with high 
probability. Let U iB) be the k x k matrix satisfying V^: k = V 1:£ R( B \ Then, 


Sfe (U 1:fe U 1 ^E[(^ 1 ))t](E[^ S2 )]Q)) = —s k (V 1:k vl k M^M^Q)) 

= ^s k (v 1 :fc E^E^(F 1 :£ )tg)) 

> ^ Sfc (M) Sfc (E^(U 1 :£ )tQ)) 

= ^s k (M)s k (X\- k k (R^) HV^Q)) 

= n[S 2 s 2 (M) 


where the last equality stems from the fact that Sk{T,\: k (R^)^) = Sk{WR B ^) and Sk{{V v1 )^Q)) = 

l-o(l)- ■ 

Lemma 15 When s 2 k (M ) = u{ mnl ° gm ), = w ( 1 )’ and = ^C 1 )’ with high prob¬ 

ability, 


mn 2 log t 


IIU - V 1;fc U^)E[(4lW)t]E[^)]Q|| 2 = o UsliM)^ ) . 


Proof: Since E[(A( 1 ))t] = fUEC/t andE[2l( B2 )] = |US(U 1:£ )t, 


(■ I - n*q*)E[(<4 (1) ) , ]E[/l (Ba) ]« = l4 + i : „ Am F i t + i,„A™E[(^ <1 >) t ]E[a< B =)]Q 
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Under the conditions of this lemman, 
from Lemma 2. Therefore, 


s i(Uh:„a™ m(B) ) = with high probability 


sMI - V' 1:t \/ 1 » t )E[( J 4< 1 ))t]E[Al< fl >>]Q) = (Pt+l:„A m S aITUa.A".®) 

<^» t+ i(M) 51 (E‘+;:^™^ + 1 : „ Am Q) 

<5 2 




16 

where the last equality stems from the fact that Elm) = w( 1) and si(Sj;+^^Uj +1:nAm ) = 
^l(^ + l:nA m ^ (S) ) = o{s k {M)y/Ifr). U 

Lemma 16 With probability 1 — 1/5, || (E[t4^ 1) ])^'((a4^ 2 ) — E[4' B2 )])Qi:fc)ll2 = 0(y/5 2 kmn). 

Proof: Since entries of are randomly sampled with probability <5/4 and independent with Q, 
for all 1 < i < n and 1 < j < k, 


E 


([(E[^( 1 >])t((^(^) - E[^(^ 2 )])Q)]^) ; 


m t 


=E 


L v 4 




5 2 


U= 1 V=1 

m t 


E E - m^-% 


U— 1 V=1 
2 m 


U= 1 V=1 


From the above inequality, E[||(E[2l( 1 )])1"((4 B2 ' ) — E[A( B2 )])(5)IIf] = (f ) 3 kmn. Therefore, by 
the Markov inequality, we conclude this proof. ■ 

Proof of Lemma 12: When = "M’ = Insertin S Lemma 14, 

Lemma 15, and Lemma 16 into (6), therefore, we conclude this proof: 

SkiV^V^EKA^W) = n(d 2 s k {M^M) ] l? ) j and 

II (I-V 1 :k vl k )E[(A^}W \\ 2 = 
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A.5. Proof of Theorem 5 

Let Py = V IUfV 1 which is an orthogonal projection matrix onto the linear span of V. Then, 
UV = | A^Py. We can bound |||?7 M|q — M\\ 2 F using the projection Py as follows: 



M||| = \\\ { M + - 5 {A^- d -M))Py\l-Mf F 

< \\{M + - 5 (A^- 5 -M))Py-Mf F 

< 2|| MP V - M ||| + 2|||(^ 2 > - |m)P^||| 

(a) 

< 211 M Py — M 111 + o(mn ) 

< 4||£/ 1:fe [4(MP^ - M)||| + 4||(J - U 1:k u\. k ){MP v 

(b) , 

< 4|| U 1:k U\ ;k (MP v - M )||| + o(mn ) 

= ^U v . k ^ k] vl k (Py-I)\\ 2 F + o(mn) 

(c) / x 

= o(mn j, 


M)||| + o(mn) 


where (a) stems from Lemma 17, (6) uses the fact that || (/ — , )M||| = o(mn), and (c) 

holds since ||V'Vj_||| = o(l) from Theorem 4. 

Lemma 17 When n = u(K/5), with high probability, |||(A^ 2 ^ — |M)P^|| | = o(mn ). 


Proof: Since entries of A 22 - 1 are randomly sampled with probability 5 /4 and independent with V, 
for all 1 < i < m and 1 < j < n, 


E 


(P (2 > 


jM)Py)„) 2 


n r. 

=E [(^[AW --M] iv [p v ] vj f 

V=1 

n X X 71 

=E[ p t-]^E[([A <21 - \m\ j„) 2 ] < 

V=1 V=1 


Since YZ= i E"=i [ p v\iw = from the above inequality, 


E[|| i{A^-jM)P^ 2 


V\\F\ 


2 m n 


i =1 7=1 

Therefore, by the Markov inequality, we conclude this proof. 




< 


4 km 
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